How much does a Ford Focus cost? - by Andras Somi

Getting the data

For this project I downloaded and cleaned the data of more than 2500 listings of used Ford Focus cars from the biggest Hungarian online used car listing site, Hasznaltauto.hu. I chose specificaly the Ford Focus for this task because:

  • it’s one of the most popular models on this market in it’s category,
  • it has a history of almost 20 years therefore quite a few versions are out there,
  • it offers a wide range of technical and aesthetic varieties.

So I hope the dataset is diverse enough to offer some interesting insights. My major questions are quite obviuous:

  1. What factors affect the price of a used Ford Focus the most on the (Hungarian) market?
  2. Is it possible to predict the ‘fair’ value of a certain car, therefore find pricing anomalies on the market?

Data wrangling

I included the Python scripts for scraping and cleaning the data in the data-scraper directory. I used pandas, requests, BeautifulSoup and pymongo for MongoDB as an itermediary data store.

The main steps were:

  1. I collected the basic details of the cars, not including most of the available text fields (eg. special technical and comfort specifications) and the longer descriptions from the sellers (though these could also hold valuable data for later analysis).
  2. I dropped those fields thatdidn’t contain enough observations and also filtered out those listings where no price (in euro) was available.
  3. I translated column names and categorical values from Hungarian to English, and made sure that the final dataset contains no directly identifiable data (eg. title, url or listing id).

(Note: The data was obtained on July, 27 and was not updated since then. This sample is only a tiny fraction of the whole database of Hasznaltauto.hu, and not intended for any purposes other than this learning project. I am not affiliated to Hasznaltauto.hu in any form.)

The dataset

## 'data.frame':    2552 obs. of  26 variables:
##  $ Doors                : num  5 5 5 5 5 5 4 5 5 5 ...
##  $ C.Max_variant        : Factor w/ 2 levels "False","True": 1 1 2 1 1 1 1 1 1 1 ...
##  $ Trunk.capacity.liters: num  520 385 NA NA NA NA 490 520 482 482 ...
##  $ Drive                : Factor w/ 2 levels "All-wheel","Front-wheel": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Capacity             : num  1596 1560 1560 1796 1753 ...
##  $ Mileage.km           : num  192000 205000 210599 182146 175117 ...
##  $ Form                 : Factor w/ 7 levels "Cabrio","Combi",..: 2 3 4 2 2 3 6 2 2 2 ...
##  $ A.C.type             : Factor w/ 3 levels "Automatic A/C",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Documents.valid      : Date, format: NA "2018-01-31" ...
##  $ Own.weight.kg        : num  1135 1352 NA NA NA ...
##  $ Horsepower           : num  101 90 109 116 116 90 101 101 101 101 ...
##  $ Condition            : Factor w/ 14 levels "Damaged","Damaged front",..: 10 10 10 10 10 10 12 11 10 10 ...
##  $ Price.EUR            : int  2678 2453 2714 2780 2780 2780 2816 2910 2940 2944 ...
##  $ Total.weight.kg      : num  1665 1790 NA NA NA ...
##  $ Fuel                 : Factor w/ 5 levels "Diesel","Electric",..: 4 1 1 4 1 1 4 4 4 4 ...
##  $ Manufactured.year    : int  2003 2005 2003 2002 2007 2005 2004 2003 2007 2003 ...
##  $ Manufactured.month   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Transmission.type    : Factor w/ 2 levels "Automatic","Manual": 1 2 2 2 2 2 2 2 2 2 ...
##  $ Gears                : num  4 5 5 5 5 5 5 5 5 5 ...
##  $ Documents            : Factor w/ 7 levels "No documents",..: 7 6 6 6 6 6 6 6 6 6 ...
##  $ Paint                : Factor w/ 19 levels "Beige","Black",..: NA 3 7 15 15 18 5 15 18 NA ...
##  $ Shade                : Factor w/ 3 levels "Dark","Light",..: NA 1 3 3 3 3 3 3 3 NA ...
##  $ Metallic             : Factor w/ 2 levels "False","True": NA 1 1 1 1 1 2 2 1 NA ...
##  $ Undamaged            : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ Age.year             : num  14 12 14 15 10 12 13 14 10 14 ...
##  $ Mileage.year         : num  13714 17083 15043 12143 17512 ...
##      Doors       C.Max_variant Trunk.capacity.liters         Drive     
##  Min.   :2.000   False:2335    Min.   :277           All-wheel  :  15  
##  1st Qu.:5.000   True : 217    1st Qu.:385           Front-wheel:2515  
##  Median :5.000                 Median :476           NA's       :  22  
##  Mean   :4.823                 Mean   :436                             
##  3rd Qu.:5.000                 3rd Qu.:482                             
##  Max.   :5.000                 Max.   :534                             
##  NA's   :20                    NA's   :106                             
##     Capacity      Mileage.km            Form               A.C.type   
##  Min.   : 138   Min.   :     1   Cabrio   :  18   Automatic A/C: 150  
##  1st Qu.:1560   1st Qu.:121660   Combi    :1144   Digital A/C  : 856  
##  Median :1596   Median :169662   Hatchback: 920   Manual A/C   :1447  
##  Mean   :1613   Mean   :163311   Minivan  : 217   NA's         :  99  
##  3rd Qu.:1753   3rd Qu.:211581   Notchback:   9                       
##  Max.   :2522   Max.   :461000   Sedan    : 242                       
##  NA's   :9      NA's   :71       Sport    :   2                       
##  Documents.valid      Own.weight.kg    Horsepower        Condition   
##  Min.   :2015-07-31   Min.   :1050   Min.   : 75.0   Normal   :1082  
##  1st Qu.:2018-03-31   1st Qu.:1220   1st Qu.: 95.0   Excellent: 658  
##  Median :2018-10-31   Median :1307   Median :109.0   Spared   : 534  
##  Mean   :2018-10-25   Mean   :1295   Mean   :109.2   Recent   : 114  
##  3rd Qu.:2019-05-31   3rd Qu.:1386   3rd Qu.:116.0   Undamaged:  89  
##  Max.   :2022-05-31   Max.   :1760   Max.   :351.0   Damaged  :  30  
##  NA's   :586          NA's   :82     NA's   :5       (Other)  :  45  
##    Price.EUR     Total.weight.kg           Fuel      Manufactured.year
##  Min.   :  324   Min.   : 187    Diesel      :1357   Min.   :1995     
##  1st Qu.: 2613   1st Qu.:1720    Electric    :   5   1st Qu.:2004     
##  Median : 4248   Median :1825    Ethanol     :   3   Median :2007     
##  Mean   : 6000   Mean   :1800    Gasoline    :1171   Mean   :2008     
##  3rd Qu.: 7822   3rd Qu.:1900    Gasoline/LPG:  15   3rd Qu.:2012     
##  Max.   :46969   Max.   :2180    NA's        :   1   Max.   :2017     
##                  NA's   :83                                           
##  Manufactured.month Transmission.type     Gears     
##  Min.   :10.00      Automatic:  72    Min.   :4.00  
##  1st Qu.:10.00      Manual   :2468    1st Qu.:5.00  
##  Median :11.00      NA's     :  12    Median :5.00  
##  Mean   :10.99                        Mean   :5.22  
##  3rd Qu.:12.00                        3rd Qu.:5.00  
##  Max.   :12.00                        Max.   :7.00  
##  NA's   :2026                         NA's   :42    
##                     Documents        Paint         Shade       Metallic   
##  No documents            : 150   Silver :434   Dark   : 316   False: 836  
##  Overdue, domestic       :  39   Blue   :420   Light  : 149   True :1233  
##  Overdue, foreign        :  57   Gray   :388   Regular:1604   NA's : 483  
##  Temp.whitdrawn, domestic:  59   Black  :292   NA's   : 483               
##  Temp.whitdrawn, foreign :  32   White  :241                              
##  Valid, domestic         :2121   (Other):294                              
##  Valid, foreign          :  94   NA's   :483                              
##  Undamaged          Age.year       Mileage.year      
##  Mode :logical   Min.   : 0.000   Min.   :    0.056  
##  FALSE:75        1st Qu.: 5.000   1st Qu.:12984.615  
##  TRUE :2477      Median :10.000   Median :16785.714  
##                  Mean   : 9.495   Mean   :      Inf  
##                  3rd Qu.:13.000   3rd Qu.:25331.125  
##                  Max.   :22.000   Max.   :      Inf  
##                                   NA's   :71

Univariate Plots Section

Price

Our main target variable in the analysis is the price. The distribution is positively skewed, the long tail reaches up to well over EUR 40 000, while the peak of the distribution is around only EUR 3000, with some observation as low as few hundred euros (or even lower). There’s also a little secondary peak at around EUR 14 000.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     324    2613    4248    6000    7822   46969

The long tail contains many outliers as the boxplot shows.

The very low price points might belong to damaged cars. The dataset contains descriptions for the condition of the cars, from which we can construct a boolean variable for indicating whether according to the seller the car is not damaged (eg. the condition is either ‘Excellent’, ‘Normal’, ‘Undamaged’, ‘Spared’ or ‘Recent’).

From 2552 observation only 75 is labeled as damaged, and majority of it is indeed offered for sale under EUR 1000. It might be useful to filter these out later.

Transforming the long-tail data (excluding damaged cars) reveals a more (but not completely) symmetrical distribution.

Mileage

The other important continous variable is the number of kilometers each car run in their lifetime.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1  121660  169662  163311  211581  461000      71

The shape of the distribution (excluding damaged cars) is fairly symmetrical, except the huge bar around 0.

This is due to cars in the dataset that are either new (some dealerships do advertise new cars on the site) or incorrectly stated as having run 1 or very few kilometers.

I consider a car ‘unusued’ if it ran less than 1000km. To remove these and the damaged cars we can construct a subset of the car_data dataframe called undamaged_used_cars.

## 'data.frame':    2211 obs. of  29 variables:
##  $ Doors                : num  5 5 5 5 5 5 4 5 5 5 ...
##  $ C.Max_variant        : Factor w/ 2 levels "False","True": 1 1 2 1 1 1 1 1 1 1 ...
##  $ Trunk.capacity.liters: num  520 385 NA NA NA NA 490 520 482 482 ...
##  $ Drive                : Factor w/ 2 levels "All-wheel","Front-wheel": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Capacity             : num  1596 1560 1560 1796 1753 ...
##  $ Mileage.km           : num  192000 205000 210599 182146 175117 ...
##  $ Form                 : Factor w/ 7 levels "Cabrio","Combi",..: 2 3 4 2 2 3 6 2 2 2 ...
##  $ A.C.type             : Factor w/ 3 levels "Automatic A/C",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Documents.valid      : Date, format: NA "2018-01-31" ...
##  $ Own.weight.kg        : num  1135 1352 NA NA NA ...
##  $ Horsepower           : num  101 90 109 116 116 90 101 101 101 101 ...
##  $ Condition            : Factor w/ 14 levels "Damaged","Damaged front",..: 10 10 10 10 10 10 12 11 10 10 ...
##  $ Price.EUR            : int  2678 2453 2714 2780 2780 2780 2816 2910 2940 2944 ...
##  $ Total.weight.kg      : num  1665 1790 NA NA NA ...
##  $ Fuel                 : chr  "Gasoline" "Diesel" "Diesel" "Gasoline" ...
##  $ Manufactured.year    : int  2003 2005 2003 2002 2007 2005 2004 2003 2007 2003 ...
##  $ Manufactured.month   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Transmission.type    : Factor w/ 2 levels "Automatic","Manual": 1 2 2 2 2 2 2 2 2 2 ...
##  $ Gears                : num  4 5 5 5 5 5 5 5 5 5 ...
##  $ Documents            : Factor w/ 7 levels "No documents",..: 7 6 6 6 6 6 6 6 6 6 ...
##  $ Paint                : Factor w/ 19 levels "Beige","Black",..: NA 3 7 15 15 18 5 15 18 NA ...
##  $ Shade                : Factor w/ 3 levels "Dark","Light",..: NA 1 3 3 3 3 3 3 3 NA ...
##  $ Metallic             : Factor w/ 2 levels "False","True": NA 1 1 1 1 1 2 2 1 NA ...
##  $ Undamaged            : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ Age.year             : num  14 12 14 15 10 12 13 14 10 14 ...
##  $ Mileage.year         : num  13714 17083 15043 12143 17512 ...
##  $ Price.log            : num  3.43 3.39 3.43 3.44 3.44 ...
##  $ Mileage.year.log     : num  4.14 4.23 4.18 4.08 4.24 ...
##  $ Fuel.group           : chr  "Gasoline" "Diesel" "Diesel" "Gasoline" ...

Using this subset mileage shows a prettier distribution: it’s slightly leptokurtic, but seems to be fairly close to normal distribution as the dashed curve shows.

Age

From Manufactured.year column we can construct the Age.year variable, that is a reasonable proxy to the real age of the car (it might miss the real age by maximum about half year in either direction as the data was downloaded slightly after mid-2017).

Unfortunatelly only a small number of observations contain the month as well, so there’s no way of defining age more precisely.

Count of values of Manufactured.month:

##   10   11   12 NA's 
##  188  156  182 2026

The distribution is slightly negatively skewed with a noticable peak at the mode of 13 years and a smaller secondary peak around 5-6 years. The median is 10 years (slightly above the mean of 9.5 years), with the quartiles being 5 and 13 years.

Plotting the density of theoretical normal distribution with the same mean and standard error reveals the slight differences.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   11.00   10.43   13.00   22.00

Mileage/year

A potential metric for how intensively was a car used before can be the milage per year (meaning how many kilometers did the car travel on average per year during its lifetime). The higher the number, the more active was the usage.

This calculation have to exclude cars made in 2017 as their age in years is 0, therefore I updated the undamaged_unused_cars subset with a condition of Age.year > 0 (this is not disruptive change as it filters out only 8 observations on top of the existing conditions).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   817.6 13000.0 16540.0 19594.0 23505.1 89087.5

This constructed variable is again positively skewed, the tail reaches above 7500km/year while the mode is around 13-14 000km/year. With a log10() transformation the distribution gets closer to normal.

Form variants

There seven different available format in the dataset, of which combi and hatchback are the two most frequent, while the cabrios, notchbacks and sports are quite rare.

## 
##    Cabrio     Combi Hatchback   Minivan Notchback     Sedan     Sport 
##        18      1025       727       210         6       223         2

Transmission, fuel, A/C, number of doors, documents

In the dataset manual transmission and manual air-conditioning is by far the dominant, just as 5-door variant over 3 or 4-doors types.

In terms of fuel the two common options (gasoline and diesel) dominate over mixed and alternative fuels (ethanol and electric drive). For each plot I excluded observations where the data was missing.

Some of the cars does not have proper legal papers for domestic usage (eg, they are imported or damaged, etc.). Most of the cars have proper valid documents but missing them could be a negative factor in pricing for the rest.

## 
##             No documents        Overdue, domestic         Overdue, foreign 
##                        2                       35                       43 
## Temp.whitdrawn, domestic  Temp.whitdrawn, foreign          Valid, domestic 
##                       45                       27                     1978 
##           Valid, foreign 
##                       81

Colors

In 2069 cases we have data on the paint of the cars, not just the color but in some cases also on the shade (light/dark) and whether the car has a metallic or regular polish.

Univariate Analysis

What is the structure of your dataset?

The dataset contains 23 variables of 2552 used cars. The continous variables are Manufactured.year and Mileage.km, while other numerical columns act more like ordinal variables (eg. engine capacity, no. of doors, no. of gears, weight, etc.) as these are determined by the variant of the model.

We also have a handful of interesting categorical variables, like the color and shade of the cars, the form type (‘combi’, ‘hatchback’, etc.).

The categorical variables somewhat contradict my initial assumption that the higher number of possible variations of the model might create a diverse dataset. Many of the categorical columns (eg. color, transmission type, fuel, etc.) are heavily concentrated around one or few typical values, though ‘exotic’ variants are also present in the data.

The most popular colors are silver, blue and gray, but if you think of silver as one kind of gray, gray is the most frequent color on the market by far. On the other end of the scale there are handful of ‘exotic’ colors, like champagne, gold and magenta.

What is/are the main feature(s) of interest in your dataset?

The price is our target variable. The distribution is not surprising, shows the natural traits of a monetary type variable, that can be treated by using log10 transformation on it. That brings it closer to a normal(ish) distribution and hopefully it helps reveal clear relationships with other variables.

There are also two interesting continous variables that might play important role indetermining the price of the car: age and mileage. None of these show extraordinary distributions, though they are not perfectly normal either, but the most common transformation do not provide meaningful benefits in these cases.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Besides age and mileage I expect some of the technical and aesthetical features to have some impact and to show relationship with the other variables, namely:

  • Form factor
  • Fuel type
  • Transmission type
  • Type and validity of documents
  • Color and polish

In a linear model we can use these via ‘dummy’ variables to gauge whether differing from the most common values have significant impact on pricing.

Did you create any new variables from existing variables in the dataset?

  1. I derived the age (counted in years) from the year when the car was manufactured. This does not alter the meaning of the data much, but it’s more reasonable to use it when jitter is added (as age can have fractional values, while the year of manufacturing can’t), even though it’s just a proxy for the real age.

  2. I created a variable for the average annual mileage (Mileage.km/Age.year) that might help explaining the difference in price of cars with the same age. It shows that how actively the car was used during its lifetime.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The initial look at prices and age surfaced two issues with the data:

1.it contains observations of damaged cars, that might have completely different profile than the undamaged ones 2. it contains cars that have only 1 or very few kilometers registered, so these are either new or mislabeled entries.

I removed both groups in the undamaged_used_cars subset without dramatically reducing the size of the sample. By removing these factors we can focus on the other important features in the dataset.

## Total number of cars:  2552 
## of which undamaged and used:  2211

Bivariate Plots Section

Quick glance on pairplots of the continous variables, using the log of Price.EUR and not transforming the others, we see a very high negative correlation with Age (-0.9) and it materializes in a clear pattern on the point plot also.

We have some strong positive and negative correlations besides the age-price relationship.

Total mileage (Mileage.km) also relatively high and even the constructed average annual mileage (Mileage.year) has detectable correlation with price. This correlation of Mileage.yearis lower when I use the log-transformed Mileage.year.

Plotting mileage and the log-transformed price for all cars (including damaged and unused ones) reveal a negative relationship between the two variables, supporting the intuitive idea that a car that has run more kilometers should be cheaper.

On the other hand the pattern is quite dispersed, the connection between the two variables does not seem to be strong (the correlation was 0.48), even when setting lower alpha to handle overplotting.

When we use only the undamaged, used cars from of the dataset the picture is better, but the relationship is still not very strong in this form, the correlation coefficient is slightly higher than -0.6.

## 
##  Pearson's product-moment correlation
## 
## data:  undamaged_used_cars$Mileage.km and undamaged_used_cars$Price.EUR
## t = -34.775, df = 2209, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6210704 -0.5671587
## sample estimates:
##        cor 
## -0.5947829

As age was calculated from the Manufactured.year column, it has discrete values. This and the overplotting makes it hard to assess the relationship (we already know that the correlation is strong).

Adding some jitter and alpha, the picture gets clearer, though there are some seemingly outlier points in both direction, and the plot seems too dense around 0 kilometer.

## 
##  Pearson's product-moment correlation
## 
## data:  undamaged_used_cars$Age.year and log10(undamaged_used_cars$Price.EUR)
## t = -104.76, df = 2209, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9191070 -0.9051162
## sample estimates:
##        cor 
## -0.9123777

If we ‘paint’ barely used or damaged cars (red), some of the outliers and much of the crowdedness around 0 pops out. It reinforces the earlier idea to exclude these cars from the analysis.

It seems we find a rather strong relationship between the age and the price of a car, which is not surprising, but good to see it proved.

Plotting mileage against age we get a rather dispersed pattern. The median mileage in the different age cohorts does not change in a linear way with age, the median mileage is roughly equal in ages between 8 and 14 years, while it increases substantiall from age 1 to 8.

It seems that the younger cars are more diverse in their annual average mileage (points are spread in a wider range). This might be due to the relatively lower number of observations in lower age cohorts where a few extreme (maybe erroneous) overall mileage value can affect the yearly average more in both directions.

In the higher age segments the distribution seems more consistent, the inter- quartile width is quite stable, while all the quartiles seem to follow a rather linear negative line.

Plotting the annual average mileage and the price (both on log-scales) does not reveal a very specific direct relationship.

## 
##  Pearson's product-moment correlation
## 
## data:  log10(undamaged_used_cars$Mileage.year) and log10(undamaged_used_cars$Price.EUR)
## t = 16.963, df = 2209, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3020638 0.3758436
## sample estimates:
##       cor 
## 0.3394758

There is some sort of relationship between the type of fuel the car uses and the age or the price, though above we already saw that alternative fuel types (not diesel or gasoline) are quite rare in the dataset.

Also cars with automatic transmission tend to be somewhat younger and a little pricier, based on the boxplots (though the mean age and mean price is roughly the same in both groups).

Engine capacity vs. Horsepower produces a great deal of overplotting. Capacity here acts like an ordinal feature, therefore the points form horizontal lines with a few outliers (138 cm3 is clearly not true, while 350hp might be the result of some tuning).

## Warning: Removed 5 rows containing non-finite values (stat_boxplot).

2D bins help to avoid overplotting and also nicely packs the slightly different engine capacity values into bins. There’s clearly a positive relationship between engince capacity and the power (not surprisingly).

Horsepower also seem to have a positive relatioship with price, though the ordinal nature of horsepower makes it hard to visualy validate this assumption.

Price distribution clearly differs by form variants.

Separating the four major varieties minivans seem to differ to a large extent from the other three distributions in its shape, while sedans seem to have a reasonably lower median price. Minivans are also very different in terms of age from the other major categories.

## # A tibble: 4 x 4
##        Form median_price mean_price standard_deviation
##      <fctr>        <dbl>      <dbl>              <dbl>
## 1     Combi       4085.0   5027.943           3298.666
## 2 Hatchback       4248.0   5463.437           4803.712
## 3   Minivan       3950.5   4058.290           1018.962
## 4     Sedan       2910.0   3898.260           2494.989

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The major finding in this section is that the relationship between age and price seem to be a very strong one, especially when we filter out the damaged or unused cars from the sample. Most likely age is the major factor in the pricing of the cars.

Also the mileage turned out to be much less important than I initially thought at least directly plotted with the price, and in this aspect, the calculated Mileage.year variable did not show important relation to the price either.

In the categorical features cars running on gasoline and with manual transmission tend to be older and somewhat cheaper.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Age and mileage are much less connected than I expected, many cars in the middle of the age range show similar milage and in a single age cohort mileage can vary in a wide range.

Also it turned out that the Minivan variants have a very different age distribution than the other major categories. Some background to that: the mini- van version was initially called Ford Focus C-Max but from 2010 it is a separate model, not a Focus variant. This explains this phenomenon (the dataset contains a boolean feature for C-max, which is equivalent of Form being Minivan).

What was the strongest relationship you found?

Clearly the relationship between age and (log) price, the older the car, the smaller the price the seller asks for it.

Multivariate Plots Section

Electric cars are clearly amongst the youngest and priciest ones, while diesel cars in the lower age segments tend to be on the cheaper end compared to gasoline cars with the same age.

We already know that silver, gray and blue combis and hatchbacks are the most common variations, but from the plot on the right it seems that buyers are better off buying violet or gold, maybe turquoise cars.

Not surprisingly the youngest and least used cars have licenses with the longest validity, but also interesting to see, that for almost all age above 3-4 the validity dates are disperesed in roughly the same range.

In lower age segments cars with higher average mileage tend to have lower price.

Minivans seem to differ not only in age and price distribution, the visible relationship between these to features is also visibly different (the slope of the regressionline is slightly smaller).

Colors do not seem to have a huge impact on age and price distribution based on this facet grid.

Minivans are different when we split price distributions by fuel type.

A nice way to visualize the relationship between age, mileage and price is to use the colors for the years. We get a nice rainbow-like pattern showing that the price of cars with the same age tend to group around the same values.

Also the color ‘bands’ have a slight negative slopes showing that cars with the same age but higher mileage tend to have lower price.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

I think most of the important relationships were already revealed in the bivariate section, but the multivariate plots mostly underpinned these findings.

Clearly the most important is the role of age in the pricing (the older the car lower the price the seller can ask for it). Also it became clear that minivans differ from the other frequent variants in many aspects (pricing and age, most notably).

Were there any interesting or surprising interactions between features?

One interesting thing is that even though the average annual mileage in itself does not show strong relationship with price, in conjunction with age it might still explain a smaller part of the variance in prices, as from cars with same age the one which has higher annual mileage tend to have lower price tag.

One new observation is that among the younger cars the different type of fuels might have an impact on prices.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I created a linear modell from some of the features of the dataset to model the pricing of the cars. The target variable is the log10 of Price.EUR.

The R-squared value is slightly over 90, which leaves some room for improvement, still, it shows that the above mentioned variables do have significant impact in determining the price of the cars.

The most important variable is Age.year, but I also included the log transformed Mileage.year and also total Mileage.km, and added Horsepower and some dummy variables of form variety, fuel, air-conditioning and license validity.

One of the limitations I encountered was that adding color and polish into the model tends to improve R-squared but with the trade-off of losing few hundred observations due to missing color information. This improvement comes at a great cost and threatens with overfitting (creating a model that works perfectly on a certain small sample, but loses predictive power on bigger populations).

Also I assume that extracting more data from the listings (eg. text fields and descriptions) might help to improve the model with additional data that explains a little more of the variance of the prices.

## 
## Call:
## lm(formula = log10(undamaged_used_cars$Price.EUR) ~ undamaged_used_cars$Age.year + 
##     log10(undamaged_used_cars$Mileage.year) + undamaged_used_cars$Mileage.km + 
##     factor(undamaged_used_cars$Fuel == "Diesel") + factor(undamaged_used_cars$Fuel == 
##     "Gasoline") + factor(undamaged_used_cars$Form == "Minivan") + 
##     factor(undamaged_used_cars$Documents == "Valid, domestic") + 
##     factor(undamaged_used_cars$A.C.type == "Manual A/C") + undamaged_used_cars$Horsepower)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49218 -0.04004  0.00302  0.04688  0.46866 
## 
## Coefficients:
##                                                                  Estimate
## (Intercept)                                                     4.297e+00
## undamaged_used_cars$Age.year                                   -5.052e-02
## log10(undamaged_used_cars$Mileage.year)                        -5.589e-02
## undamaged_used_cars$Mileage.km                                 -6.392e-07
## factor(undamaged_used_cars$Fuel == "Diesel")TRUE               -6.400e-02
## factor(undamaged_used_cars$Fuel == "Gasoline")TRUE             -4.912e-02
## factor(undamaged_used_cars$Form == "Minivan")TRUE               5.616e-02
## factor(undamaged_used_cars$Documents == "Valid, domestic")TRUE  6.748e-02
## factor(undamaged_used_cars$A.C.type == "Manual A/C")TRUE       -1.721e-02
## undamaged_used_cars$Horsepower                                  1.816e-03
##                                                                Std. Error
## (Intercept)                                                     8.051e-02
## undamaged_used_cars$Age.year                                    9.964e-04
## log10(undamaged_used_cars$Mileage.year)                         1.852e-02
## undamaged_used_cars$Mileage.km                                  5.590e-08
## factor(undamaged_used_cars$Fuel == "Diesel")TRUE                1.798e-02
## factor(undamaged_used_cars$Fuel == "Gasoline")TRUE              1.792e-02
## factor(undamaged_used_cars$Form == "Minivan")TRUE               6.229e-03
## factor(undamaged_used_cars$Documents == "Valid, domestic")TRUE  5.869e-03
## factor(undamaged_used_cars$A.C.type == "Manual A/C")TRUE        4.222e-03
## undamaged_used_cars$Horsepower                                  8.689e-05
##                                                                t value
## (Intercept)                                                     53.373
## undamaged_used_cars$Age.year                                   -50.697
## log10(undamaged_used_cars$Mileage.year)                         -3.018
## undamaged_used_cars$Mileage.km                                 -11.434
## factor(undamaged_used_cars$Fuel == "Diesel")TRUE                -3.560
## factor(undamaged_used_cars$Fuel == "Gasoline")TRUE              -2.740
## factor(undamaged_used_cars$Form == "Minivan")TRUE                9.016
## factor(undamaged_used_cars$Documents == "Valid, domestic")TRUE  11.497
## factor(undamaged_used_cars$A.C.type == "Manual A/C")TRUE        -4.076
## undamaged_used_cars$Horsepower                                  20.897
##                                                                Pr(>|t|)
## (Intercept)                                                     < 2e-16
## undamaged_used_cars$Age.year                                    < 2e-16
## log10(undamaged_used_cars$Mileage.year)                        0.002575
## undamaged_used_cars$Mileage.km                                  < 2e-16
## factor(undamaged_used_cars$Fuel == "Diesel")TRUE               0.000379
## factor(undamaged_used_cars$Fuel == "Gasoline")TRUE             0.006194
## factor(undamaged_used_cars$Form == "Minivan")TRUE               < 2e-16
## factor(undamaged_used_cars$Documents == "Valid, domestic")TRUE  < 2e-16
## factor(undamaged_used_cars$A.C.type == "Manual A/C")TRUE       4.75e-05
## undamaged_used_cars$Horsepower                                  < 2e-16
##                                                                   
## (Intercept)                                                    ***
## undamaged_used_cars$Age.year                                   ***
## log10(undamaged_used_cars$Mileage.year)                        ** 
## undamaged_used_cars$Mileage.km                                 ***
## factor(undamaged_used_cars$Fuel == "Diesel")TRUE               ***
## factor(undamaged_used_cars$Fuel == "Gasoline")TRUE             ** 
## factor(undamaged_used_cars$Form == "Minivan")TRUE              ***
## factor(undamaged_used_cars$Documents == "Valid, domestic")TRUE ***
## factor(undamaged_used_cars$A.C.type == "Manual A/C")TRUE       ***
## undamaged_used_cars$Horsepower                                 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08195 on 2117 degrees of freedom
##   (84 observations deleted due to missingness)
## Multiple R-squared:  0.906,  Adjusted R-squared:  0.9056 
## F-statistic:  2267 on 9 and 2117 DF,  p-value: < 2.2e-16

Final Plots and Summary

Plot One

Description One

I particularly like this plot because it packs two of the most important relationships in this analysis into one graph:

  1. Age has a negative impact on price, the points are rather close to the regression line with negative slope.

  2. In any age group the cars with lower average milage tend to have higher prices. The points above the regression line seem to be darker in general, but this gets even sharper in the lower half of the date range.

I prefer this plot over using boxplot in this particular case because boxplot partially masks the shape of the underlying distribution.

Plot Two

Description Two

I like this plot because it clearly visualizes some differences between the main form varieties and fuel types. This plot helped me to recognize how different is the minivan from the others, which in turn helped me to learn the special history of the C-max variant (starting as a variant, later becoming a separate model).

  1. The minivan has a much narrower price distribution, the cars are more tightly grouped around they median and mean (the distribution is pretty symmetrical too).

2.This is the only variant where there are only diesel or gasoline fueled cars in the sample (no alternative fuels)

For the other models gasoline cars tend to have slightly (or even substantially) lower median price than diesel ones.

Plot Three

Description Three

Finally a visual proof that the color of the car has no meaningful impact on the price the seller asks, neither if the car has one of the most frequent colors nor if it’s one of the exotic variants. All the plots show very similar distribution, except where we have too few observation, and exepct the minivans, that we already saw somewhat differs from the other categories.

No surprise that none of the variables in connection with the color was included in the linear model above.

Aside from this I like how the facet_grid() let’s me visualize at least 4 variables, not counting the aesthetics of the points, like size, color and shape.


Reflection

It was exciting to explore a very recent, up-to-date, real-life dataset that I personally collected. The dataset also reinforced some of my initial assumptions (eg. the age must be an important factor) but also falsified a few (like the major importance of mileage in the pricing) and answered my questions, like does the color of the car matter in the price.

The custom dataset was not as clean and well balanced between different features as many of the usual learning datasets (eg. some of the features turned out to be concentrated around one or few values). Therefore I had to go back to the cleaning script several times, when something popped out in first part of this analysis.

I plan to work on with this dataset, and also I would like to broaden it, in two ways:

  1. Get more data of each listing, eg. qualitative technical and convenience features from text fields, geolocation (where is the seller/car located), etc. These might help to improve the predicative power of the model.

  2. Include some other popoular car brands and models in the sample to have a significantly larger dataset, and to see how different (or similar) are the factors and their weights in determining the price of each car.

I also believe that by analysing the longer descriptions that the sellers providing with NLP tools and machine learning might reveal some more important patterns, and in conjunction with the other data it could even reveal some clues about the genuineness of the listings, and help filter out scams and fraudulent sellers.